Skip to content

Conversation

@reuvenlax
Copy link
Contributor

@reuvenlax reuvenlax commented Oct 7, 2025

This PR fixes multiple fundamental issues in tableRowFromMessage. This function is usually used in the dead-letter output from BigQueryIO, though it has other uses as well.

  • If the BigQuery table has a column named "f" the existing code throws an exception. This is because the TableRow class has an implicit field named "f" which conflicts. The fix is to detect this case and format the TableRow using setF instead. We do this only if an "f" column is detected to maintain backwards compatibility with existing usage.
  • When converting a TableRow to a protocol buffer, specific field types (e.g. DATE, DATETIME) are encoded using an efficient encoding structure. The existing tableRowFromMessage function didn't know about these encodings, so when converting back to TableRow it simply copied the raw encoded data back. Users of TableRow don't know how to deal with these encodings, so this effectively corrupted the result. In order to properly decode these types, we had to extend the internal format function to take extra information about the BigQuery schema.

In addition to fixing these bugs, many tests were added.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @reuvenlax, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the tableRowFromMessage method to use TableRow.setF() for populating TableRow objects, improving correctness and efficiency. It also includes a minor cleanup by removing an unused import.

Highlights

  • Intent: The primary intent of this pull request is to refactor the tableRowFromMessage method in TableRowToStorageApiProto.java to use the TableRow.setF() method for populating TableRow objects. This change addresses a previous TODO comment in the code, indicating that setF() is the more correct and efficient way to construct TableRow instances.
  • Changes Summary: The core change is in sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/TableRowToStorageApiProto.java. The tableRowFromMessage method has been significantly refactored. Instead of iterating through message fields and calling tableRow.put() for each, it now constructs a List<TableCell> by iterating through the message's field descriptors. Each TableCell is populated with its value using setV() if the field exists and passes CDC column exclusion and field inclusion checks. Finally, the TableRow is created and populated using tableRow.setF(tableCells). Additionally, an unused import for java.util.Objects was removed from sdks/java/io/debezium/src/main/java/org/apache/beam/io/debezium/DebeziumReadSchemaTransformProvider.java.
  • Reviewer Activity Summary: No reviewer activity has been provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2025

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@reuvenlax reuvenlax force-pushed the fix_reverse_f_issue branch from e5e58c1 to 214e73d Compare October 8, 2025 05:06
@liferoad liferoad requested a review from Abacn October 8, 2025 16:41
Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@reuvenlax
Copy link
Contributor Author

@Abacn Upon adding a unit test, I realized that there are some more fundamental problems with this code (pre existing) - it corrupts certain types. Trying to fix this now.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2025

Assigning reviewers:

R: @m-trieu for label java.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@liferoad
Copy link
Contributor

Fixes #33531

@liferoad
Copy link
Contributor

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring to improve how BigQueryIO handles Protobuf messages, specifically to support field names like "f" which conflict with TableRow's internal API. The changes are extensive, introducing SerializableBiFunction to pass schema information, and refactoring TableRowToStorageApiProto to use a more robust setF-based approach for TableRow creation. The introduction of SerializableBiFunctions and new tests for various proto encodings are great additions. My review includes a few minor suggestions for code style and clarity in the new helper classes, and a question about a change in test coverage. Overall, this is a high-quality contribution that improves the robustness of BigQueryIO.

@liferoad liferoad requested a review from Abacn October 11, 2025 13:58
@reuvenlax reuvenlax changed the title tableRowFromMessage should use setF Fix issues in tableRowFromMessage Oct 11, 2025
@reuvenlax
Copy link
Contributor Author

/gemini review

@reuvenlax
Copy link
Contributor Author

@Abacn @liferoad here is the updated PR. The issues were deeper than expected, hence the more involved PR (and all the new test coverage)

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant fixes to tableRowFromMessage to handle edge cases and improve data conversion correctness. Specifically, it addresses issues with BigQuery columns named "f" by introducing a fallback to using TableRow.setF, and it correctly decodes special BigQuery types (like DATE, DATETIME) from protocol buffers by leveraging schema information. The changes are extensive, refactoring many parts of the BigQueryIO write path to plumb through the necessary schema information, and introducing a SerializableBiFunction to formatting functions. The addition of SerializableBiFunctions provides useful utilities. The code changes look solid and are well-tested with new unit and integration tests. I have a couple of suggestions for improving exception messages and cleaning up test code.

Comment on lines 1418 to 1552
throw new RuntimeException(
"Not implemented yet "
+ fieldDescriptor.getMessageType().getName()
+ " PARTIAL NAME "
+ fieldDescriptor.getMessageType().getName()
+ " FIELD NAME "
+ prefix
+ " CLASS TYPE "
+ fieldValue.getClass());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The exception message here seems to contain debugging information ("PARTIAL NAME", "FIELD NAME", "CLASS TYPE"). It would be better to provide a cleaner, more user-friendly message. Also, consider using UnsupportedOperationException instead of a generic RuntimeException for cases that are not implemented. This applies to similar throw new RuntimeException("Not implemented...") statements in this method for other types like BOOL, INT64, BYTES, and TIMESTAMP.

throw new UnsupportedOperationException(
    "Converting BigQuery 'DOUBLE' from a protobuf message of type '"
        + fieldDescriptor.getMessageType().getName()
        + "' is not supported. Field: "
        + prefix);

@Abacn
Copy link
Contributor

Abacn commented Oct 14, 2025

Thanks, understand there is a deeper issue in tableRowFromMessage and requires substantial changes, given that there is still 20+ tests breaking (https://github.com/apache/beam/pull/36425/checks?check_run_id=52537069798), I imagine it may take time to clean up all stuff. Alternatively would it be acceptable to adopt the last reviewed version (214e73d) for 2.69.0 ?

cc: @liferoad WDYT?

@liferoad
Copy link
Contributor

@reuvenlax
Copy link
Contributor Author

Agree. We will get this in after the cut.

@reuvenlax reuvenlax force-pushed the fix_reverse_f_issue branch from a7bc70c to 4141b42 Compare October 19, 2025 23:12
@reuvenlax
Copy link
Contributor Author

@Abacn all tests now passing

@reuvenlax
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant fixes and refactorings to tableRowFromMessage to address fundamental issues with data type conversions and field name conflicts. The changes correctly handle BigQuery columns named "f" by using setF and ensure proper decoding of special types like DATE and DATETIME by leveraging schema information during conversion from protocol buffers. The introduction of SerializableBiFunction and its helpers is a good way to evolve the API while maintaining backward compatibility. The code is also made more maintainable by centralizing conversion logic. I've found a couple of potential issues related to locale-sensitivity in number formatting and incorrect timestamp formatting that could lead to bugs. Overall, this is a solid improvement.

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, understanding the need of plumbing schemainformation led to refactorings. Had one comments hopefully find useful. Have done review with schemainformation/getF fixes, still going through date time changes

@reuvenlax reuvenlax force-pushed the fix_reverse_f_issue branch from 8f98821 to 3e65c16 Compare October 28, 2025 18:59
Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

(just run /gemini review again to check typos - to my experience the tool is good at finding this kind of issue)

@Abacn
Copy link
Contributor

Abacn commented Oct 30, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant and well-executed fixes to tableRowFromMessage, addressing fundamental issues with type decoding and handling of special field names. The introduction of SchemaInformation to guide the conversion from protocol buffers to TableRow is a crucial improvement for correctly handling BigQuery-specific types like DATE and DATETIME. The clever fallback mechanism to handle table columns named "f" by using setFlipstick is a great way to ensure backward compatibility while fixing the underlying issue. The refactoring of type conversion logic into a map of converters in TableRowToStorageApiProto greatly improves the code's structure and maintainability. The extensive addition of tests covering these fixes is commendable. I have one suggestion regarding a DateTimeFormatter implementation that could be improved.

@reuvenlax reuvenlax merged commit d46a013 into apache:master Oct 31, 2025
18 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants